Diamonds have been consider a visual appealing fashionable statement for more than 500 years. We have been hired by a diamond wholesaler to better understand the diamond markert , identfy trends and build a model that can predict diamond prices for the next year.
To provide this in-dept analysis we have been given a data set containing sales information from 2010 to 2021.
The dataset contains 11 attributes:
Upon exploring the data we quickly realized that serveral cleaning processes needed to be done.
We dropped serveral null values located in the carat categories.
We noticed serveral negative values within the cost(dollars) columns. It was decided that the these values would be transformed into postivie values. Are reasoning is this could have been human error when entering the data.
Outliners were discovered within the length (mm), width (mm), height (mm) and cost(dollars) columns were also droped.
Upon research height (mm) contained some corrupted data that could not physically be possible for a diamonds which were dropped.
# Importing all necessary libraries
import pandas as pd
import numpy as np
import requests
import io
import math
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import preprocessing
# Downloading the csv
df = pd.read_csv("/Users/clarkuniveristy/Desktop/HACK/2022-Data-Tech.Dive-main/wholesale_diamonds.csv")
# dropping N/A values
df =df.dropna()
# Chaning all negative cost (dollars ) values to positive values
df["cost (dollars)"]= abs(df["cost (dollars)"])
#removing the 0 outliners in the dataset found in length (mm) width (mm) height (mm) and cost . found w/ .describe()
#rougly 180 rows
df = df.drop(df[df["cost (dollars)"] == 0].index)
df = df.drop(df[df["length (mm)"]==0].index)
df = df.drop(df[df["width (mm)"]==0].index)
df = df.drop(df[df["height (mm)"]==0].index)
df = df.drop(df[df["height (mm)"]>10].index)
# Creating a new dataset for onehot coding
onehot_data = df.copy()
# Lable coding categorical attributes to orginal dataset
df.cut.replace({"Ideal":5, 'Premium':4, 'Good':2, 'Very Good':3, 'Fair':1}, inplace=True)
df.color.replace({'E':2, 'I':6, 'J':7, 'H':5, 'F':3, 'G':4, 'D':1}, inplace=True)
df.clarity.replace({'SI2':1, 'SI1':2, 'VS1':3, 'VS2':4, 'VVS2':5, 'VVS1':6, 'I1':7, 'IF':8}, inplace=True)
#one hot , Creating dummy variables for cut, color , and calarity column
dummy_cut =pd.get_dummies(onehot_data["cut"])
dummy_color =pd.get_dummies(onehot_data["color"])
dummy_clarity =pd.get_dummies(onehot_data["clarity"])
#mergeing dummies variables to main dataset frame
onehot_data = onehot_data.merge (dummy_cut, left_index= True, right_index =True )
onehot_data = onehot_data.merge (dummy_color, left_index= True, right_index =True )
onehot_data = onehot_data.merge (dummy_clarity, left_index= True, right_index =True )
# finding categorical variables- Alternative Way
#s = (df.dtypes =="object")
#object_cols = list(s[s].index)
# performing. lable coding instead of onehot.new data set
#label_data = df.copy()
#label_encoder = preprocessing.LabelEncoder()
#for col in object_cols:
#label_data [col] = label_encoder.fit_transform(label_data [col])
#viewing the entire columns w/ first 5 entries- one hot
pd.set_option('display.max_columns', None)
onehot_data.head()
| index | carat | cut | color | clarity | depth | table | cost (dollars) | length (mm) | width (mm) | height (mm) | year | Fair | Good | Ideal | Premium | Very Good | D | E | F | G | H | I | J | I1 | IF | SI1 | SI2 | VS1 | VS2 | VVS1 | VVS2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | Ideal | E | SI2 | 62 | 55 | 326 | 4 | 4 | 2 | 2010 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | Good | E | VS1 | 57 | 65 | 327 | 4 | 4 | 2 | 2010 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 2 | 0 | Premium | I | VS2 | 62 | 58 | 334 | 4 | 4 | 3 | 2010 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3 | 3 | 0 | Good | J | SI2 | 63 | 58 | 335 | 4 | 4 | 3 | 2010 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 4 | 0 | Very Good | J | VVS2 | 63 | 57 | 336 | 4 | 4 | 2 | 2010 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
#viewing the entire columns w/ first 5 entries - lable
df.head()
| index | carat | cut | color | clarity | depth | table | cost (dollars) | length (mm) | width (mm) | height (mm) | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 5 | 2 | 1 | 62 | 55 | 326 | 4 | 4 | 2 | 2010 |
| 1 | 1 | 0 | 2 | 2 | 3 | 57 | 65 | 327 | 4 | 4 | 2 | 2010 |
| 2 | 2 | 0 | 4 | 6 | 4 | 62 | 58 | 334 | 4 | 4 | 3 | 2010 |
| 3 | 3 | 0 | 2 | 7 | 1 | 63 | 58 | 335 | 4 | 4 | 3 | 2010 |
| 4 | 4 | 0 | 3 | 7 | 5 | 63 | 57 | 336 | 4 | 4 | 2 | 2010 |
After cleaning the dataset we are left with 405085 observations. Down below are 7 top graded analysis to that will answering questions like:
How much do diamonds cost on average? Whats the variance and distribution of prices?
How many diamonds of each type of color and color are there?
How many diamonds when showing the interactions of cut and color?
What is the summary statistic of diamonds based each type of cut? What is the market makeup based on cut these past 10 years?
• How does the diamond cost vary with carat, year, color, and other properties? • correlations between the variables. • identify trends • Clustering • Use an off-the-shelf algorithm to see if the diamonds in your dataset can be naturally grouped into clusters.
#1
# Gettingt the unique identifiers for the color atrribute
color_uq= np.unique(onehot_data['color'])
# Gettingt the unique identifiers for the calarity atrribute
clarity_uq= np.unique(onehot_data['clarity'])
# Gettingt the unique identifiers for the year atrribute
year_uq= np.unique(onehot_data['year'])
# Gettingt the unique identifiers for the cut atrribute
cut_uq= np.unique(onehot_data['cut'])
print("The unique indentifiers of the categorial arttributes:","\n color:\t",color_uq, "\n clarity:",clarity_uq,"\n year:\t",year_uq,"\n cut:\t",cut_uq )
print("\n\n\n A generate statistical summary of the numerical arttributes")
df[["carat", "depth", "table", "cost (dollars)", "length (mm)", "width (mm)", "height (mm)"]].describe()
The unique indentifiers of the categorial arttributes: color: ['D' 'E' 'F' 'G' 'H' 'I' 'J'] clarity: ['I1' 'IF' 'SI1' 'SI2' 'VS1' 'VS2' 'VVS1' 'VVS2'] year: [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021] cut: ['Fair' 'Good' 'Ideal' 'Premium' 'Very Good'] A generate statistical summary of the numerical arttributes
| carat | depth | table | cost (dollars) | length (mm) | width (mm) | height (mm) | |
|---|---|---|---|---|---|---|---|
| count | 405085.000000 | 405085.000000 | 405085.000000 | 405085.000000 | 405085.000000 | 405085.000000 | 405085.000000 |
| mean | 0.797578 | 61.747658 | 57.457164 | 4377.043633 | 5.730504 | 5.732525 | 3.538812 |
| std | 0.474646 | 1.434013 | 2.239784 | 4497.973368 | 1.121080 | 1.112819 | 0.692617 |
| min | 0.200000 | 43.000000 | 43.000000 | 1.000000 | 3.730000 | 3.680000 | 1.070000 |
| 25% | 0.400000 | 61.000000 | 56.000000 | 1042.000000 | 4.710000 | 4.720000 | 2.910000 |
| 50% | 0.700000 | 61.800000 | 57.000000 | 2654.000000 | 5.690000 | 5.710000 | 3.520000 |
| 75% | 1.040000 | 62.500000 | 59.000000 | 5960.000000 | 6.530000 | 6.530000 | 4.030000 |
| max | 4.130000 | 79.000000 | 95.000000 | 26930.000000 | 10.140000 | 10.100000 | 6.430000 |
Fun fact: Based on research and the cost(dollars) attribute we can clearing detect there is a possible cutomer groups within the dataset that my be highlighted.
# 2
import plotly.express as px
color_c = px.colors.sequential.Burg
fig = px.histogram(onehot_data, x='color', color='color', template='plotly_white', title='color count', color_discrete_sequence=color_c ,category_orders={'color' : color_uq})
fig.update_layout(height=400, width=600)
fig = px.histogram(onehot_data, x='clarity', color ='clarity', template='plotly_white', title='clarity count', color_discrete_sequence=color_c ,category_orders={'clarity' : clarity_uq})
fig.update_layout(height=400, width=600)
fig = px.histogram(onehot_data, x='cut', color='color', template='plotly_white', title='color x cut count', color_discrete_sequence=color_c ,category_orders={'cut' : cut_uq})
fig.update_layout(height=400, width=600)
fig = px.box(onehot_data, x='cost (dollars)', color='cut',template='plotly_white', log_x=True, title='Cost distribution by cut (log scale)', color_discrete_sequence=color_c, category_orders={'cut' : cut_uq})
fig.update_layout(height=400, width=600)
fig = px.box(onehot_data, x='cost (dollars)', color='clarity',template='plotly_white', log_x=True, title='Cost distribution by clarity (log scale)', color_discrete_sequence=color_c, category_orders={'clarity' : clarity_uq})
fig.update_layout(height=400, width=600)
fig = px.box(onehot_data, x='cost (dollars)', color='color',template='plotly_white', log_x=True, title='Cost distribution by color (log scale)', color_discrete_sequence=color_c, category_orders={'clarity' : color_uq})
fig.update_layout(height=400, width=600)
groupdata = onehot_data.groupby("cut").count()
groupdata
x_vals = []
y_vals = []
for i in [0, 1, 4, 3, 2]:
x_vals.append(groupdata.index[i])
y_vals.append(groupdata.iloc[i,0])
explode = (0, 0, 0.1, 0, 0) # only "explode" the 2nd slice (i.e. 'Very good')
plt.figure(figsize = [8,8])
plt.pie(y_vals, explode=explode, labels=x_vals, autopct='%1.1f%%',
textprops={'fontsize': 16, 'fontweight' : 20, 'color' : 'Black'}, startangle=90)
# Adding and formatting title
plt.title("Distribution based on cut\n", fontdict={'fontsize': 20, 'fontweight' : 20, 'color' : 'Black'})
plt.show()
#scatter plot
sns.pairplot(label_data, diag_kind="hist")
<seaborn.axisgrid.PairGrid at 0x7f82eebfb1f0>
#1 Correlation matrix
corrmat= label_data.corr()
f, ax = plt.subplots(figsize=(12,12))
maptitle = plt.axes()
maptitle.set_title('Correlation Map')
sns.heatmap(corrmat,cmap="Pastel2",annot=True)
<AxesSubplot:title={'center':'Correlation Map'}>
At first glace key features to look at are height, width, lenght, and carat due to the high correlation to cost. However, due to external reseach on the diamond industry we will include color, clarity and cut.
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
import statsmodels.api as sm
from statsmodels. formula.api import logit
#remanaming columns
df.rename(columns={'length (mm)': 'lenght', 'width (mm)': 'width', 'height (mm)': 'height', 'cost (dollars)': 'price'}, inplace=True)
#chaning data types for modeling purposes
pd.options.display.float_format = '{:,.0f}'.format
df["lenght"]=df["lenght"].astype(int)
df["width"]=df["width"].astype(int)
df["height"]=df["height"].astype(int)
df["carat"]=df["carat"].astype(int)
df["depth"]=df["depth"].astype(int)
df["table"]=df["table"].astype(int)
df["clarity"]=df["clarity"].astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 405085 entries, 0 to 407279 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 405085 non-null int64 1 carat 405085 non-null int64 2 cut 405085 non-null int64 3 color 405085 non-null int64 4 clarity 405085 non-null int64 5 depth 405085 non-null int64 6 table 405085 non-null int64 7 price 405085 non-null int64 8 lenght 405085 non-null int64 9 width 405085 non-null int64 10 height 405085 non-null int64 11 year 405085 non-null int64 dtypes: int64(12) memory usage: 56.3 MB
#identifying x and y and spliting the data
X = df[["carat", "width", "lenght","height","color","cut", "clarity"]]
Y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=25)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy
import matplotlib.ticker as ticker
import geopandas
import plotnine
from plotnine import *
import shapefile as shp
Running linear regression, decision tree, random forest models:
import sklearn.linear_model as sl
linreg = sl.LinearRegression()
linreg.fit(X_train, y_train)
LinearRegression()
print("R squared of the Linear Regression on training set: {:.2%}".format(linreg.score(X_train, y_train)))
print("R squared of the Linear Regression on test set: {:.2%}".format(linreg.score(X_test, y_test)))
R squared of the Linear Regression on training set: 82.30% R squared of the Linear Regression on test set: 82.56%
y_pred = linreg.predict(X_test)
sns.scatterplot(x=y_test , y=y_pred, color="blue")
<AxesSubplot:xlabel='price'>
import sklearn.tree as st
tree = st.DecisionTreeRegressor(random_state=25)
tree.fit(X_train, y_train)
DecisionTreeRegressor(random_state=25)
print("R squared of the Decision Tree Regressor on training set: {:.2%}".format(tree.score(X_train, y_train)))
print("R squared of the Decision Tree Regressor on test set: {:.2%}".format(tree.score(X_test, y_test)))
R squared of the Decision Tree Regressor on training set: 94.17% R squared of the Decision Tree Regressor on test set: 94.22%
y_pred1 = tree.predict(X_test)
sns.scatterplot(x=y_test , y=y_pred1, color="red")
<AxesSubplot:xlabel='price'>
import sklearn.ensemble as se
rf = se.RandomForestRegressor(n_estimators=100, random_state=25)
rf.fit(X_train, y_train)
RandomForestRegressor(random_state=25)
print("R squared of the Random Forest Regressor on training set: {:.2%}".format(rf.score(X_train, y_train)))
print("R squared of the Random Forest Regressor on test set: {:.2%}".format(rf.score(X_test, y_test)))
R squared of the Random Forest Regressor on training set: 94.17% R squared of the Random Forest Regressor on test set: 94.22%
y_pred2 = rf.predict(X_test)
sns.scatterplot(x=y_test , y=y_pred2, color="green")
<AxesSubplot:xlabel='price'>
So far the random forest and decison trees both have the highest r square.
Next Step: Fitting the model to understand accuracy.
# Linear regression
d = {'true': y_test, 'predicted': y_pred}
df_lr = pd.DataFrame(data=d)
df_lr['diff'] = df_lr['predicted']-df_lr['true']
df_lr
| true | predicted | diff | |
|---|---|---|---|
| 367704 | 1830 | 1,947 | 117 |
| 102823 | 2973 | 3,146 | 173 |
| 313968 | 6750 | 7,419 | 669 |
| 399151 | 1446 | 1,107 | -339 |
| 331735 | 1481 | 896 | -585 |
| ... | ... | ... | ... |
| 392228 | 937 | -526 | -1,463 |
| 301667 | 2258 | 3,628 | 1,370 |
| 350385 | 7773 | 9,958 | 2,185 |
| 298393 | 1472 | 1,489 | 17 |
| 213248 | 7196 | 9,988 | 2,792 |
81017 rows × 3 columns
#ploting prices
sns.distplot((y_pred-y_test),bins=50, color= "blue");
/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
from sklearn import metrics
#Linear
print('MAE:', metrics.mean_absolute_error(y_test,y_pred))
print('MSE:', metrics.mean_squared_error(y_test,y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
MAE: 1260.981003123481 MSE: 3568093.138899226 RMSE: 1888.9396864111957
print("Mean Squared Log Error of the Linear Regression on test set is {:.2%}".format(metrics.mean_squared_log_error(y_test,y_pred)))
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-124-5549577c7d45> in <module> ----> 1 print("Mean Squared Log Error of the Linear Regression on test set is {:.2%}".format(metrics.mean_squared_log_error(y_test,y_pred))) /opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0 /opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_regression.py in mean_squared_log_error(y_true, y_pred, sample_weight, multioutput) 411 412 if (y_true < 0).any() or (y_pred < 0).any(): --> 413 raise ValueError("Mean Squared Logarithmic Error cannot be used when " 414 "targets contain negative values.") 415 ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.
plt.hist(y_pred[y_pred < 0 ])
(array([ 12., 52., 39., 91., 158., 185., 491., 1059., 1999.,
3095.]),
array([-3123.82223366, -2812.36324427, -2500.90425487, -2189.44526548,
-1877.98627608, -1566.52728669, -1255.06829729, -943.6093079 ,
-632.1503185 , -320.69132911, -9.23233971]),
<BarContainer object of 10 artists>)
The linear regression model will be dropped due to the negative predicted price values.
#decison tree
d = {'true': y_test, 'predicted': y_pred1}
df_lr = pd.DataFrame(data=d)
df_lr['diff'] = df_lr['predicted']-df_lr['true']
df_lr
| true | predicted | diff | |
|---|---|---|---|
| 367704 | 1830 | 2,378 | 548 |
| 102823 | 2973 | 2,396 | -577 |
| 313968 | 6750 | 6,370 | -380 |
| 399151 | 1446 | 1,037 | -409 |
| 331735 | 1481 | 1,182 | -299 |
| ... | ... | ... | ... |
| 392228 | 937 | 654 | -283 |
| 301667 | 2258 | 2,494 | 236 |
| 350385 | 7773 | 9,725 | 1,952 |
| 298393 | 1472 | 975 | -497 |
| 213248 | 7196 | 8,789 | 1,593 |
81017 rows × 3 columns
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test,y_pred1))
print('MSE:', metrics.mean_squared_error(y_test,y_pred1))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test,y_pred1)))
MAE: 643.8280204872035 MSE: 1182157.8598273674 RMSE: 1087.2708309466263
print("Mean Squared Log Error of the Decision Tree on test set is {:.2%}".format(metrics.mean_squared_log_error(y_test,y_pred1)))
Mean Squared Log Error of the Decision Tree on test set is 7.49%
Our decision tree regressor model was able to predict the price of every diamond in the test set with an error of ± 7.49% of the real price.
#ploting prices
sns.distplot((y_pred1-y_test),bins=50, color= "blue");
/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
#checking for negative prices
plt.hist(y_pred1[y_pred1 < 0 ])
(array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]), <BarContainer object of 10 artists>)
#random. forrest
d = {'true': y_test, 'predicted': y_pred2}
df_lr = pd.DataFrame(data=d)
df_lr['diff'] = df_lr['predicted']-df_lr['true']
df_lr
| true | predicted | diff | |
|---|---|---|---|
| 367704 | 1830 | 2,381 | 551 |
| 102823 | 2973 | 2,390 | -583 |
| 313968 | 6750 | 6,371 | -379 |
| 399151 | 1446 | 1,037 | -409 |
| 331735 | 1481 | 1,181 | -300 |
| ... | ... | ... | ... |
| 392228 | 937 | 654 | -283 |
| 301667 | 2258 | 2,494 | 236 |
| 350385 | 7773 | 9,721 | 1,948 |
| 298393 | 1472 | 976 | -496 |
| 213248 | 7196 | 8,782 | 1,586 |
81017 rows × 3 columns
print('MAE:', metrics.mean_absolute_error(y_test,y_pred2))
print('MSE:', metrics.mean_squared_error(y_test,y_pred2))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test,y_pred2)))
MAE: 643.8495717416579 MSE: 1182132.6360726622 RMSE: 1087.2592313117705
print("Mean Squared Log Error of the Random Forest on test set is {:.2%}".format(metrics.mean_squared_log_error(y_test,y_pred2)))
Mean Squared Log Error of the Random Forest on test set is 7.49%
Our decision tree regressor model was able to predict the price of every diamond in the test set with an error of ± 7.49% of the real price.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
#from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn import metrics
pipeline_kn=Pipeline([("scalar4",StandardScaler()),
("rf_classifier",KNeighborsRegressor())])
pipeline_xgb=Pipeline([("scalar5",StandardScaler()),
("rf_classifier",XGBRegressor())])
# List of all the pipelines
pipelines = [pipeline_kn,pipeline_xgb ]
# Dictionary of pipelines and model types for ease of reference
pipe_dict = {0: "KNeighbors",1: "XGBRegressor" }
# Fit the pipelines
for pipe in pipelines:
pipe.fit(X_train, y_train)
cv_results_rms = []
for i, model in enumerate(pipelines):
cv_score = cross_val_score(model, X_train,y_train,scoring="neg_root_mean_squared_error", cv=10)
cv_results_rms.append(cv_score)
print("%s: %f " % (pipe_dict[i], cv_score.mean()))
from xgboost import XGBRegressor
xgbr = XGBRegressor(verbosity=0)
xgbr.fit(X_train,y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=4,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=0)
print("R squared of the Linear Regression on training set: {:.2%}".format(xgbr.score(X_train, y_train)))
print("R squared of the Linear Regression on test set: {:.2%}".format(xgbr.score(X_test, y_test)))
R squared of the Linear Regression on training set: 94.04% R squared of the Linear Regression on test set: 94.15%
y_pred3 = xgbr.predict(X_test)
sns.scatterplot(x=y_test , y=y_pred3, color="purple")
<AxesSubplot:xlabel='price'>
d = {'true': y_test, 'predicted': y_pred3}
df_lr = pd.DataFrame(data=d)
df_lr['diff'] = df_lr['predicted']-df_lr['true']
df_lr
| true | predicted | diff | |
|---|---|---|---|
| 367704 | 1830 | 2,427 | 597 |
| 102823 | 2973 | 2,441 | -532 |
| 313968 | 6750 | 6,322 | -428 |
| 399151 | 1446 | 1,034 | -412 |
| 331735 | 1481 | 1,100 | -381 |
| ... | ... | ... | ... |
| 392228 | 937 | 649 | -288 |
| 301667 | 2258 | 2,501 | 243 |
| 350385 | 7773 | 9,503 | 1,730 |
| 298393 | 1472 | 967 | -505 |
| 213248 | 7196 | 8,986 | 1,790 |
81017 rows × 3 columns
print('MAE:', metrics.mean_absolute_error(y_test,y_pred3))
print('MSE:', metrics.mean_squared_error(y_test,y_pred3))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test,y_pred3)))
MAE: 651.3103238831392 MSE: 1197535.4549568368 RMSE: 1094.3196310753256
print("Mean Squared Log Error of the XGBRegressor on test set is {:.2%}".format(metrics.mean_squared_log_error(y_test,y_pred3)))
Mean Squared Log Error of the XGBRegressor on test set is 7.59%
The Random Forest model that was chosen due to the small MSE and high Rsquare.
model = rf
model
RandomForestRegressor(random_state=25)
# saving the model
import pickle
pickle_out = open("classifier.pkl", mode = "wb")
pickle.dump(model, pickle_out)
pickle_out.close()
# Required preprocessing
test=pd.read_csv("/Users/clarkuniveristy/Desktop/HACK/2022-Data-Tech.Dive-main/diamonds_for_sale_2022.csv")
test =test.dropna()
# Chaning all negative cost (dollars ) values to positive values
#test["cost (dollars)"]= abs(test["cost (dollars)"])
#removing the 0 outliners in the dataset found in length (mm) width (mm) height (mm) and cost . found w/ .describe()
#rougly 180 rows
#test = test.drop(test[test["cost (dollars)"] == 0].index)
test = test.drop(test[test["length (mm)"]==0].index)
test = test.drop(test[test["width (mm)"]==0].index)
test = test.drop(test[test["height (mm)"]==0].index)
test = test.drop(test[test["height (mm)"]>10].index)
test.cut.replace({"Ideal":5, 'Premium':4, 'Good':2, 'Very Good':3, 'Fair':1}, inplace=True)
test.color.replace({'E':2, 'I':6, 'J':7, 'H':5, 'F':3, 'G':4, 'D':1}, inplace=True)
test.clarity.replace({'SI2':1, 'SI1':2, 'VS1':3, 'VS2':4, 'VVS2':5, 'VVS1':6, 'I1':7, 'IF':8}, inplace=True)
test.rename(columns={'length (mm)': 'lenght', 'width (mm)': 'width', 'height (mm)': 'height', 'cost (dollars)': 'price'}, inplace=True)
pd.options.display.float_format = '{:,.0f}'.format
test["lenght"]=test["lenght"].astype(int)
test["width"]=test["width"].astype(int)
test["height"]=test["height"].astype(int)
test["carat"]=test["carat"].astype(int)
test["depth"]=test["depth"].astype(int)
test["table"]=test["table"].astype(int)
The basic requirments to delpoy the data is now complete!